Skip to content

Conversation

monatis
Copy link
Owner

@monatis monatis commented Oct 1, 2023

This is still WIP

After I implemented the GGUF support in clip.cpp, now it's time to combine clip.cpp + llama.cpp = llava.cpp (the first model to be supported in this repo).

For now, I copy CLIP conversion + model loading + inference code from clip.cpp and make necessary changes. In the future, these changes may be merged upstream and clip.cpp may be a submodule in this repo.

  • LLaVA surgery: merge base and LoRA weights, strip the multimodal projector.
  • Convert the LLaMA part with llama.cpp.
  • Update CLIP conversion script to save a LLaVA encoder model in GGUF.
  • Load CLIP vision model with LLaVA projector in clip.cpp.
  • Update clip_image_encode function to get image hidden states from layers[-2].
  • Write a simple example for end-to-end LLaVA infrence.

I think This is enough for the initial release. I will streamline the implementation afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant